A short description of the post.
In January 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing.
In Mini-Challenge 2,we will analysis GAStech’s company cars gps records, credit card transactions and loyalty card usage data. From this data, we will identify anomalies and suspicious behaviors and identify which people use which credit and loyalty cards. All questions will be responsed by using visual analysis.
The same dataset have been used in 2014 VAST Challenge
The packages using for this Assignment shows below:
| No. | Packages | Function |
|---|---|---|
| 1 | tidyverse | Data cleaning and manipulation |
| 2 | ggplot2 | ggplot2 is a system for declaratively creating graphics |
| 3 | ggiraph | ggiraph is a tool that allows you to create dynamic ggplot graphs. |
| 4 | lubridate,clock |
packages = c('tmap', 'clock',
'tidyverse','ggplot2','ggiraph',
'lubridate','ggthemes','viridis','plotly','treemapify','sf',
'raster','readr','tmap','mapview'
)
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Rows: 1,490
Columns: 4
$ timestamp <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <int> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty_data)
Rows: 1,392
Columns: 4
$ timestamp <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
Before any analysis, we first prepared cc_data and loyalty card. We find timestamp is treated as “Character”" data type instead of date data type. We correct this by using lubridate package. Then separate timestamp into day, weekday for further analysis. Change the “Last4num” datatype from int to chr. Then calculate the total price by location, card number ,card number and locations, respectively.
cc_data$dmy_hm <- date_time_parse(cc_data$timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
cc_data$weekday <- wday(cc_data$dmy_hm,
label = TRUE,
abbr = FALSE)
cc_data$hour <- hour(cc_data$dmy_hm)
cc_data$day <- day(cc_data$dmy_hm)
cc_data$last4ccnum <- as.character(cc_data$last4ccnum)
cc_data <- cc_data %>%
group_by(location) %>%
mutate(total_price_location = sum(price)) %>%
ungroup()
cc_data <- cc_data %>%
group_by(last4ccnum,location) %>%
mutate(total_price_loccnum = sum(price)) %>%
ungroup()
cc_data <- cc_data %>%
group_by(last4ccnum) %>%
mutate(total_price_ccnum = sum(price)) %>%
ungroup()
loyalty_data$dmy<- date_time_parse(loyalty_data$timestamp,
zone = "",
format = "%m/%d/%Y")
loyalty_data$weekday <- wday(loyalty_data$dmy,
label = TRUE,
abbr = FALSE)
loyalty_data$day <- day(loyalty_data$dmy)
loyalty_data <- loyalty_data %>%
group_by(location) %>%
mutate(total_price_location = sum(price)) %>%
ungroup()
loyalty_data <- loyalty_data %>%
group_by(loyaltynum,location) %>%
mutate(total_price_lolynum = sum(price)) %>%
ungroup()
loyalty_data <- loyalty_data %>%
group_by(loyaltynum) %>%
mutate(total_price_lynum = sum(price)) %>%
ungroup()
car$dmy_hm <- date_time_parse(car$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M:%S")
car$day <- day(car$dmy_hm)
car$hour <- hour(car$dmy_hm)
First, we look at cc card and loyalty card separately. For all location, we segment them into four groups: Business Dinning Living and Unknown. Through this bar chart we observed that the most popular place for each group is Katerina’s Café, Hippokampos, Abila Airport and Ouzeri Elian. Overall Katerina’s Café is the most popular place and have been visit more than 200 times. Because of the relationship between credit card and loyalty card, loyalty card records give us the same result.
cc_bar_chart <- cc_data %>%
group_by(location) %>%
summarise(n = n()) %>%
ungroup()
oldvalues <- c("Abila Airport","Abila Scrapyard","Abila Zacharo",
"Ahaggo Museum","Albert's Fine Clothing",
"Bean There Done That","Brew've Been Served",
"Brewed Awakenings","Carlyle Chemical Inc.",
"Chostus Hotel","Coffee Cameleon","Coffee Shack",
"Desafio Golf Course","Frank's Fuel",
"Frydos Autosupply n' More","Gelatogalore",
"General Grocer","Guy's Gyros","Hallowed Grounds",
"Hippokampos","Jack's Magical Beans","Kalami Kafenion",
"Katerina’s Café","Kronos Mart","Kronos Pipe and Irrigation",
"Maximum Iron and Steel","Nationwide Refinery",
"Octavio's Office Supplies","Ouzeri Elian",
"Roberts and Sons","Shoppers' Delight",
"Stewart and Sons Fabrication","U-Pump")
newvalues <- factor(c("Business","Business","Unknown",
"Living","Living","Unknown","Dinning",
"Unknown","Business","Living","Dinning",
"Dinning","Living","Unknown","Unknown",
"Dinning","Living","Dinning","Dinning",
"Living","Living","Unknown","Dinning",
"Living","Business","Business","Business",
"Business","Unknown","Business","Living",
"Business","Unknown"
)) # Make this a factor
cc_bar_chart$type <- newvalues[ match(cc_bar_chart$location, oldvalues) ]
p <- ggplot(cc_bar_chart,aes(x = reorder(location,-n),
y = n,fill = type)) +
geom_col(color="black") +
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = .5))
p <- p + ggtitle("Frequnecy of Credit Card by Location") +
labs(y="Frequency", x = "Location")
ggplotly(p)
Then we plot heatmaps by weekday and hour of each day for the most popular location. Abila Airport’s records mainly in workday between 8:00 to 16:00. Th popular time of Katerina’s Café and Hippokampos in 12:00 to 13:00 and 18:00 and 20:00. Those are the time period when employee have their lunch break or off office. The most popular time for Abila Zacharo is 12:00-13:00.
plot <- cc_data%>%
group_by(location,weekday,hour,) %>%
summarise(n = n()) %>%
ungroup()
full_df <- expand.grid(location = unique(cc_data$location),
weekday = unique(cc_data$weekday),
hour = c(1:23))
plot_full <- plot %>%
right_join(
full_df,
by = c('location','weekday','hour')
) %>%
mutate(hour = as.ordered(hour)) %>%
replace_na(list(n = 0L)) %>%
filter(location %in% c("Katerina’s Café","Hippokampos",
"Abila Airport","Abila Zacharo"))
pwk<- ggplot(plot_full, aes(hour, weekday, fill = n)) +
geom_tile(color = 'white',size = 0.1) +
facet_grid(~location) +
scale_fill_distiller(palette = "Reds",direction = 1)
pwk

One Anomaly is credit card record and loyalty card record not match. According to the background information, loyalty cards records are collecting from employee’s credit card purchases and preferences records. Loyalty card records should match credit card records. However, cc_data has total 1490 rows and loyalty_data only include 1392 rows. This phenomenon hints us for some transaction, customer didn’t use loyalty card for discounts and rewards. Further dig the data, we also find, on the opposite, some customer using loyalty card but didn’t have credit card records. This kind of record happened in Stewart and Sons Fabrication and Carlyle Chemical Inc. One possible reason is those transaction is not paid by credit card.
{picture}
One way to correct this is to find a whole transaction records by join cc_data and loyalty_data on Location, Price and Date. Because timestamp in cc_data have detailed time when transaction happened, some records happened in same location, price and data may have error. After manual remove the mismatched information, we get a new dataset include 1490 rows.
The box plot blow shows us several information. First, Top locations of total transaction price are business relative place or company. Second, some locations have transaction records extremely higher than normal.
ggplot(cc_data, aes(x=price, y=reorder(location,total_price_location))) +
geom_boxplot(outlier.colour="tan1")

Dig deeper into those transaction’s card number and price spend. We find some suspicious phenomenon: 9551 13th $10000 extremely large amount of purchase compare with other purchase records happened in Frydos Autosupply n’ More and 9551’s other transaction. This is the only record for 9551 in this location.
5010 18th $600 extremely large amount of purchase compare with other purchase records happened in Chostus Hotel. This is the only record for 5010 in this location.Date is close to when kidnap happened.
1321 17th $1239 extremely large amount of purchase compare with other purchase records happened in Albert’s Fine Clothing and 1321’s other transaction. This is the only record for 1321 in this location. Date is close to when kidnap happened.
Below heatmap shows total transaction price by location by credit card number. Next step, we will add gps data in further explore the suspicious phenomenon and find the owner of credit card.
ggplot(cc_data,aes(x = last4ccnum,y = location,fill = total_price_ccnum))+
geom_tile(color = 'white',size = 0.1)+
scale_fill_distiller(palette = "Green",direction = 1)+
theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = .5))

Looking at gps data only, first we find the rush period is Mon-Fri morning 7:00 to 8:00, Noon 12:00 to 13:00 and afternoon 17:00 to 18:00. We also find there are unusual gps records at 3am in 7th ,9th ,11th and 14th.
car_data <- car %>%
group_by(day,hour) %>%
summarise(n = n()) %>%
ungroup()
full_df_car <- expand.grid(day = unique(car$day),
hour = c(1:23))
plot_full_car <- car_data %>%
right_join(
full_df_car,
by = c('day','hour')
) %>%
mutate(hour = as.ordered(hour)) %>%
mutate(day = as.ordered(day)) %>%
replace_na(list(n = 0L))
ggplot(plot_full_car, aes(hour,day,fill = n)) +
geom_tile(color = 'white',size = 0.1) +
scale_fill_distiller(palette = "Green",direction = 1)

car$id <- as.factor(car$id)
car_data <- car %>%
group_by(id,day) %>%
summarise(n = n()) %>%
ungroup()
ggplot(car_data,aes(x = day,y = id,fill = n)) + geom_tile()+
scale_fill_distiller(palette = "blue",direction = 1)

car$id <- as.character(car$id)
car_data <- car %>%
group_by(id,hour) %>%
summarise(n = n()) %>%
ungroup()
ggplot(car_data,aes(x = hour,y = id,fill = n)) + geom_tile()+
scale_fill_distiller(palette = "Blue",direction = 1)

In this step, we input map and clear the noise gps data. Group gps data by CarID, day and hour, then combine cc_card transaction information and car moving route to locate owner of card and activities they may involve in.
bgmap <- raster('D:/ISSS608 Data Visualization/MC2/MC2/Geospatial/MC2-tourist.tif')
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1595, 2706, 4316070 (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05 (x, y)
extent : 24.82419, 24.90976, 36.04499, 36.09543 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist.tif
names : MC2.tourist
values : 0, 255 (min, max)
tmap_mode('plot')
tm_shape(bgmap) +
tm_raster(bgmap,
legend.show = FALSE)
tmap_mode('view')
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)
Ablia_st <- st_read(dsn = "D:/ISSS608 Data Visualization/MC2/MC2/Geospatial",
layer = 'Abila')
Reading layer `Abila' from data source `D:\ISSS608 Data Visualization\MC2\MC2\Geospatial' using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
geometry type: LINESTRING
dimension: XY
bbox: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
geographic CRS: WGS 84
gps <- read_csv("D:/ISSS608 Data Visualization/MC2/MC2/gps.csv")
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M:%S")
gps$day <- as.factor(get_day(gps$Timestamp))
gps$id <- as_factor(gps$id)
gps$hour <- as.factor(get_hour(gps$Timestamp))
gps_sf <- st_as_sf(gps,
coords = c("long","lat"),
crs = 4326)
gps_path <- gps_sf %>%
group_by(id, day,hour) %>%
summarize(m = mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
p = npts(gps_path, by_feature = TRUE)
gps_path2 <- cbind(gps_path, p)
gps_path <- gps_path2 %>%
filter(p > 1)
All the movement at 3am are belong to Security department and they are Bodrogi Loreto(15), Mies Minke(24), Osvaldo Hennie(16), Osvaldo Hennie(21). This make this sign more suspicious. (pirctur)
CarID 15 Bodrogi Loreto from last record on 6th is at 17:00 from GAScompany to home(?) however, He move at next 3am to near Speston Park and stay there for whole night then directly went to work next day. Similar path in 9th 3am, he went to place near Taxiarchan Park.
gps_path_150617 <- gps_path %>%
filter(id == 15,day == 6,hour ==17)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_150617) + tm_lines()
gps_path_150703 <- gps_path %>%
filter(id == 15,day == 7,hour ==3)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_150703) + tm_lines()
gps_path_150903 <- gps_path %>%
filter(id == 15,day == 9,hour == 3)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_150903) + tm_lines()
Mies Minke(24) also have movement in 9th and we find he went to same place near Taxiarchan Park as CarID 15. Hence, there is a high possibility they meet secretly.
gps_path_240903 <- gps_path %>%
filter(id == 24,day == 9,hour == 3)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_240903) + tm_lines()
Vann Isia(16) and Osvaldo Hennie(21) met in 11th midnight. the activity this similar to what happened with Bodrogi Loreto and Mies Minke. We have reason to doubt Security Department’s Perimeter Controls were planning something.
Osvaldo Hennie(21) left until 11:00 next day and we miss the data Vann Isia(16)’s gps data from 3:00 to 18:00. Hence, we cannot located where she might go.
gps_path_161103 <- gps_path %>%
filter(id == 16,day == 11,hour == 3)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_161103) + tm_lines()
Osvaldo Hennie(21) have same path like 11th went to temple at 3:00 and this time met with Mies Minke(24)
In the conclusion, We have reason to doubt Security Department’s Perimeter Controls were planning something. Bodrogi Loreto(15), Mies Minke(24), Osvaldo Hennie(16), Osvaldo Hennie(21) all have path met near temple and stay for whole night.